In order to minimize the loss for the bank, we need to predict the likelihood that a company will go bankrupt in the future so that the bank divests their investments in the company before they go bankrupt.
The team is using both Random Forest and XGBoost. XGBoost works to optimize the error and given our unbalanced target class, it performs better than Random Forest. In preparation for the model, we loaded all of the .arff files into one dataframe and decoded the attributes which were formatted in a byte format. The data was cleansed and scaled prior to modeling.
The use case requires accuracy be used as a metric, however, due to the unbalanced nature of the target class we will also measure precision and recall and weighted F1 as true measures of the model's ability to predict bankruptcy.
import os
import pandas as pd
import numpy as np
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from scipy.io import arff
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from yellowbrick.model_selection import FeatureImportances
import seaborn as sns
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, classification_report, roc_auc_score,accuracy_score,average_precision_score, precision_recall_curve, plot_precision_recall_curve
import warnings
warnings.filterwarnings('ignore')
data_file = os.listdir("./data/")
data_file
my_data = pd.DataFrame()
data_df = pd.DataFrame()
for i in data_file:
path = "./data/"+i
print(path)
data = arff.loadarff(path)
data_df = pd.DataFrame(data[0])
my_data = my_data.append(data_df)
my_data['class'] = my_data['class'].str.decode('utf-8')
my_data = my_data.rename(columns={'Attr1': 'net profit / total assets',
'Attr2': 'total liabilities / total assets',
'Attr3': 'working capital / total assets',
'Attr4': 'current assets / short-term liabilities',
'Attr5': '[(cash + short-term securities + receivables - short-term liabilities) / (operating expenses - depreciation)] * 365',
'Attr6': 'retained earnings / total assets',
'Attr7': 'EBIT / total assets',
'Attr8': 'book value of equity / total liabilities',
'Attr9': 'sales / total assets',
'Attr10': 'equity / total assets',
'Attr11': '(gross profit + extraordinary items + financial expenses) / total assets',
'Attr12': 'gross profit / short-term liabilities',
'Attr13': '(gross profit + depreciation) / sales',
'Attr14': '(gross profit + interest) / total assets',
'Attr15': '(total liabilities * 365) / (gross profit + depreciation)',
'Attr16': '(gross profit + depreciation) / total liabilities',
'Attr17': 'total assets / total liabilities',
'Attr18': 'gross profit / total assets',
'Attr19': 'gross profit / sales',
'Attr20': '(inventory * 365) / sales',
'Attr21': 'sales (n) / sales (n-1)',
'Attr22': 'profit on operating activities / total assets',
'Attr23': 'net profit / sales',
'Attr24': 'gross profit (in 3 years) / total assets',
'Attr25': '(equity - share capital) / total assets',
'Attr26': '(net profit + depreciation) / total liabilities',
'Attr27': 'profit on operating activities / financial expenses',
'Attr28': 'working capital / fixed assets',
'Attr29': 'logarithm of total assets',
'Attr30': '(total liabilities - cash) / sales',
'Attr31': '(gross profit + interest) / sales',
'Attr32': '(current liabilities * 365) / cost of products sold',
'Attr33': 'operating expenses / short-term liabilities',
'Attr34': 'operating expenses / total liabilities',
'Attr35': 'profit on sales / total assets',
'Attr36': 'total sales / total assets',
'Attr37': '(current assets - inventories) / long-term liabilities',
'Attr38': 'constant capital / total assets',
'Attr39': 'profit on sales / sales',
'Attr40': '(current assets - inventory - receivables) / short-term liabilities',
'Attr41': 'total liabilities / ((profit on operating activities + depreciation) * (12/365))',
'Attr42': 'profit on operating activities / sales',
'Attr43': 'rotation receivables + inventory turnover in days',
'Attr44': '(receivables * 365) / sales',
'Attr45': 'net profit / inventory',
'Attr46': '(current assets - inventory) / short-term liabilities',
'Attr47': '(inventory * 365) / cost of products sold',
'Attr48': 'EBITDA (profit on operating activities - depreciation) / total assets',
'Attr49': 'EBITDA (profit on operating activities - depreciation) / sales',
'Attr50': 'current assets / total liabilities',
'Attr51': 'short-term liabilities / total assets',
'Attr52': '(short-term liabilities * 365) / cost of products sold)',
'Attr53': 'equity / fixed assets',
'Attr54': 'constant capital / fixed assets',
'Attr55': 'working capital',
'Attr56': '(sales - cost of products sold) / sales',
'Attr57': '(current assets - inventory - short-term liabilities) / (sales - gross profit - depreciation)',
'Attr58': 'total costs /total sales',
'Attr59': 'long-term liabilities / equity',
'Attr60': 'sales / inventory',
'Attr61': 'sales / receivables',
'Attr62': '(short-term liabilities *365) / sales',
'Attr63': 'sales / short-term liabilities',
'Attr64': 'sales / fixed assets'})
my_data.head()
my_data.describe()
colms = np.where(my_data.isnull().sum() > 0)
print(colms)
my_data = my_data.fillna(0)
colms = np.where(my_data.isnull().sum() > 0)
print(colms)
my_data.groupby(['class']).mean()
sns.boxplot(x='class', y='profit on operating activities / financial expenses', data=my_data).set(title='Boxplot for profit on operating activities / financial expenses')
sns.boxplot(x='class', y='(gross profit + depreciation) / total liabilities', data=my_data).set(title='Boxplot for (gross profit + depreciation) / total liabilities')
sns.boxplot(x='class', y='gross profit (in 3 years) / total assets', data=my_data).set(title='Boxplot for gross profit (in 3 years) / total assets')
my_data_top = my_data[['profit on operating activities / financial expenses', '(gross profit + depreciation) / total liabilities',
'gross profit (in 3 years) / total assets','profit on sales / sales','(current assets - inventory) / short-term liabilities',
'(net profit + depreciation) / total liabilities','(gross profit + depreciation) / sales','retained earnings / total assets',
'total liabilities / ((profit on operating activities + depreciation) * (12/365))','profit on operating activities / sales']]
data_corr = my_data_top.corr() # grabs correlation variables of features
mask = np.zeros_like(data_corr, dtype=bool) # returns array of zeros w/ same shape and type of given array
mask[np.triu_indices_from(mask)]= True # Generate a mask for the upper triangle
f, ax = plt.subplots(figsize=(11, 9)) # Matplotlib figure setup / formats nicely
sns.heatmap(data_corr,
mask = mask,
square = True,
cmap = 'coolwarm', # Easier visualization of correlated variables
annot = True,
annot_kws = {'size': 12})
ax.set(title='Heatmap of Features with Significant Importanct to our Models')
X = my_data.drop("class" , axis=1)
y = my_data['class']
under_sample = RandomUnderSampler(sampling_strategy='all')
X_under, y_under = under_sample.fit_resample(X,y)
X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(X_under, y_under, test_size=.30, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)
scaler = StandardScaler()
X_train_scaler = scaler.fit(X_train)
X_train_scaled = X_train_scaler.transform(X_train)
X_train_scaled = pd.DataFrame(data = X_train_scaled, columns = X_train.columns)
X_test_scaled = X_train_scaler.transform(X_test)
X_test_scaled = pd.DataFrame(data = X_test_scaled,columns = X_test.columns )
X_train_scaler_u = scaler.fit(X_train_under)
X_train_scaled_u = X_train_scaler.transform(X_train_under)
X_train_scaled_u = pd.DataFrame(data = X_train_scaled_u, columns = X_train_under.columns)
X_test_scaled_u = X_train_scaler_u.transform(X_test_under)
X_test_scaled_u = pd.DataFrame(data = X_test_scaled_u,columns = X_test_under.columns )
dTree = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state=1)
dTree = dTree.fit(X_train_scaled_u, y_train_under)
y_pred = dTree.predict(X_test_under)
accuracy_score(y_pred,y_test_under)
cnf_matrix = confusion_matrix(y_test_under, y_pred)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
print(classification_report(y_test_under, y_pred))
abcl = GradientBoostingClassifier(n_estimators=35,criterion = 'mae', random_state=1999)
abcl = abcl.fit(X_train_scaled_u, y_train_under)
y_pred_xgb = abcl.predict(X_test_under)
accuracy_score(y_pred_xgb, y_test_under)
cnf_matrix = confusion_matrix(y_test_under, y_pred_xgb)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
print(classification_report(y_test_under, y_pred_xgb))
viz = FeatureImportances(dTree, relative=False,topn=10)
viz.fit(X_train_under, y_train_under)
After exploring regular, smote (augmented) and down-sampled data on a Random Forest and XGBoost models, the team concluded that the down-sampled approach, using a random forest algorithm did the best job of effectively predicting companies that will go bankrupt.
There is a trade-off in the model performance, optimizing bankruptcy identification comes at the cost of falsely identifiying healthy companies as potential bankruptcies. A business decision should be made to determine the level of false positives the company is willing tolerate in order to minimize bankruptcy exposure risk.
Given the complexity of the data separation, more sophisticated classification models should be evaluated to improve the ability to predict future bankrupt companies. More research should be done on leveraging the leading indicators highlighted as significant to the model to further optimize the investment revenue realization.